Introduction to supervised text classification

Malo Jan

2024-12-22

Overview

  • Last two days : how to get text data and obtain numerical representations of texts
  • Today : how to use these numerical representations to automate the classification of texts into categories, one of the main use cases of text analysis in social sciences
  • Supervised text classification : what is it, how to do it, how to evaluate it

Text classification in social sciences

  • Long tradition in social sciences to measure/operationalize concepts from text : content analysis (different from interpretive approach) (Krippendorff 2018)
  • Assign predefined categories to texts units to measure a social phenomena
  • Historical use through manual content analysis
  • Use of (multiple) Experts/RAs/crowdworkers to code texts
  • Exemples :

Advantages and limits of manual content analysis

  • Importance of human judgment, context-knowledge

  • High quality data, in comparison to automated methods such as dictionnaries

  • Highly time consuming and human labor intensive

    • Eg : Manifesto Project, CAP
  • Often need to rely on a small sample of texts, which can be biased

    • Eg. Protest event Analysis

Let the machine come in

  • Text classification can be automated with supervised machine learning
  • One of the main use cases of text analysis in social sciences
  • “Augment” the human, does not replace it
  • Rather than coding all the texts manually, we can train a model to predict the categories based on a sample that we have coded
  • Supervised : we need a labeled dataset to train the model
  • Unsupervised : we do not need a labeled dataset, the model will learn the categories by itself
  • The model will learn the text features that are associated with the categories

Inputs and outpus of a supervised machine learning model

  • Input : text features + categories (labels in ML lingo)
  • Model : a function that will learn the relationship between the text features and the categories
  • Ouput :
    • Probabilities of belonging to each category
    • If two categories, a binary classification, usually > 0.5
    • If more than two categories, a multi-class classification, class assign is highest probability
    • Potential multi-label classification

Uses cases

Supervised text classification pipeline

  1. Get a clean corpus of texts segmented into units (eg. documents, sentences, paragraphs)
  2. Get labelled data
    • Sample a subset of the corpus to annotate
    • Set coding rules, annotate the sample + refine codebook
  3. Get text features
    • Tokenize the texts + preprocess
    • Convert into numerical features (BOW, TF-IDF, word embeddings, BERT, etc.)
  4. Train a model
    • Split the annotated sample into a training set and a test set (70/30 rule)
    • Choose a model (logistic regression, SVM, random forest, neural network, transformer)
    • Train a model on the training set based on text features and labels
  5. Evaluate the model
    • Evaluate the model on the test set
    • Compute performance metrics (accuracy, precision, recall, F1-score)
  6. Infer the model
    • Apply the model to the rest of the corpus
  7. Validation
  8. Use in downstream analysis

Do, Ollion, and Shen (2024) : Policy vs Politics classification task ## Getting labelled data {.smaller}

  • Training a “classifier” requires labelled data : texts with categories assigned
  • Existing labels
    • Potential labels with similar categories from other projects
    • Labels in the wild : metadata from datasets/website
  • Labelling data : sample of the corpus
    • Random sample, stratified sample
    • RAs, crowdworkers, or yourself
    • Coding rules, codebook
    • Inter-coder reliability
    • Ideally, a random sample, sometimes stratified
    • Most projects use RAs or crowdworkers : but accuracy, intercoder reliability
    • But often as phd students, we do it ourselves
  • Curcial part from experience : get to know the data ## Supervised learning models {.smaller}
  • Goal is to learn a function that maps input text features to labels,

  • To do so, we train a algorithm predicting Y (labels) from X (text features) to find the best parameters to minimize the error

  • A supervised classification model is a function that learns the relationship between input text features and labels and returns the probability of a text belonging to a category

  • Classical models : Logistic regression, Naive Bayes, SVM, Random Forest

    • Simple and fast
    • Rely on bag-of-words representations
    • Lot of preprocessing needed
    • Limited performance
  • Transformers models (eg: BERT)

    • Contextual representation of texts
    • No need for extensive preprocessing
    • Transfer-learning paradigm : models have already language knowledge
    • Computationally intensive

Evaluating performance of the model

  • The evaluation of a model is done by comparing the predictions of the model with the true categories, a “gold standard”
  • We need a test set, a set of texts that the model has not seen but for which we know the categories, “held-out” data
  • Accuracy : proportion of correctly classified texts

Equation :

\[ \ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \ \] - Accuracy is highly limited for imbalanced datasets

Confusion matrix

  • Positive class : the class we want to predict
  • Negative class : the other class
Predicted Positive Predicted Negative
Actual Positive True Positive False Negative
Actual Negative False Positive True Negative

Recall, Precision and f1-score

  • Recall : proportion of actual positive cases that were correctly classified

\[ \ \text{Recall} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Negative}} \ \]

  • Precision : proportion of predicted positive cases that were correctly classified

\[ \ \text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}} \ \]

\[ \ \text{f1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \ \]

What is a good performance ?

  • Performance not always satisfactory
  • A “good” performance depends on the complexity of the task
    • Conceptual complexity
    • Task complexity : multi-class, imbalanced classes
  • In general, we look for at least f1-score > 0.7

How to improve performance

  • Reasons for poor performance :
    • Unbalanced classes -> Undersampling & oversampling
    • Not enough training data -> More annotation
    • Bad quality of the training data -> Better annotation
    • Bad quality of the text features -> Better preprocessing
    • Limited text representation -> Go for more complex models

Inference

  • Once a model is trained and evaluated, it can be used for inference
  • For a supervised learning model, the inference is the prediction of the categories for new (unseen) texts
  • Corpus of 1 million texts, model trained on 1000, use the model to classify all the remaining texts

Measurement validity

  • Face validity
  • Convergence validity

Benchmark with other methods

  • Dictionnary-based methods

What to do classification outputs ?

  • It is all about measurement !
  • Often some aggregated measure of the categories is used
  • Descriptives : evolution over time, across countries, groups
  • Downstream tasks :
    • Use measurement as DV in a regression
    • Use measurement as IV in a regression
  • But careful because : measurement error can bias the results

Challenges in supervised text classification

  • Only the first step
  • Still a lot of manual work
  • Imbalanced classes : active learning
  • Limit of bag of words representations
Bonikowski, Bart, Yuchen Luo, and Oscar Stuhler. 2022. “Politics as Usual? Measuring Populism, Nationalism, and Authoritarianism in US Presidential Campaigns (1952–2020) with Neural Language Models.” Sociological Methods & Research 51 (4): 1721–87.
Burnham, Michael. 2024. “Stance Detection: A Practical Guide to Classifying Political Beliefs in Text.” Political Science Research and Methods, 1–18.
Burscher, Bjorn, Rens Vliegenthart, and Claes H De Vreese. 2015. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize Across Contexts?” The ANNALS of the American Academy of Political and Social Science 659 (1): 122–31.
Burst, Pola AND Franzmann, Tobias AND Lehmann. 2024. “Manifestoberta. Version 56topics.sentence.2024.1.1.” Berlin / Göttingen: Wissenschaftszentrum Berlin für Sozialforschung / Göttinger Institut für Demokratieforschung. https://doi.org/10.25522/manifesto.manifestoberta.56topics.sentence.2024.1.1.
Do, Salomé, Étienne Ollion, and Rubing Shen. 2024. “The Augmented Social Scientist: Using Sequential Transfer Learning to Annotate Millions of Texts with Human-Level Accuracy.” Sociological Methods & Research 53 (3): 1167–1200.
Krippendorff, Klaus. 2018. Content Analysis: An Introduction to Its Methodology. Sage publications.
Licht, Hauke, Tarik Abou-Chadi, Pablo Barberá, and Whitney Hua. 2024. “Measuring and Understanding Parties’ Anti-Elite Strategies.”
Licht, Hauke, and Ronja Sczepanksi. 2024. “Who Are They Talking about? Detecting Mentions of Social Groups in Political Texts with Supervised Learning.” ECONtribute Discussion Paper.
Müller, Stefan, and Sven-Oliver Proksch. 2024. “Nostalgia in European Party Politics: A Text-Based Measurement Approach.” British Journal of Political Science 54 (3): 993–1005.